import pandas as pd
df = pd.read_csv('lightcast_job_postings.csv')
df.head()
ID LAST_UPDATED_DATE LAST_UPDATED_TIMESTAMP DUPLICATES POSTED EXPIRED DURATION SOURCE_TYPES SOURCES URL ... NAICS_2022_2 NAICS_2022_2_NAME NAICS_2022_3 NAICS_2022_3_NAME NAICS_2022_4 NAICS_2022_4_NAME NAICS_2022_5 NAICS_2022_5_NAME NAICS_2022_6 NAICS_2022_6_NAME
0 1f57d95acf4dc67ed2819eb12f049f6a5c11782c 2024-09-06 2024-09-06 20:32:57.352 Z 0 2024-06-02 2024-06-08 6.0 [\n "Company"\n] [\n "brassring.com"\n] [\n "https://sjobs.brassring.com/TGnewUI/Sear... ... 44 Retail Trade 441 Motor Vehicle and Parts Dealers 4413 Automotive Parts, Accessories, and Tire Retailers 44133 Automotive Parts and Accessories Retailers 441330 Automotive Parts and Accessories Retailers
1 0cb072af26757b6c4ea9464472a50a443af681ac 2024-08-02 2024-08-02 17:08:58.838 Z 0 2024-06-02 2024-08-01 NaN [\n "Job Board"\n] [\n "maine.gov"\n] [\n "https://joblink.maine.gov/jobs/1085740"\n] ... 56 Administrative and Support and Waste Managemen... 561 Administrative and Support Services 5613 Employment Services 56132 Temporary Help Services 561320 Temporary Help Services
2 85318b12b3331fa490d32ad014379df01855c557 2024-09-06 2024-09-06 20:32:57.352 Z 1 2024-06-02 2024-07-07 35.0 [\n "Job Board"\n] [\n "dejobs.org"\n] [\n "https://dejobs.org/dallas-tx/data-analys... ... 52 Finance and Insurance 524 Insurance Carriers and Related Activities 5242 Agencies, Brokerages, and Other Insurance Rela... 52429 Other Insurance Related Activities 524291 Claims Adjusting
3 1b5c3941e54a1889ef4f8ae55b401a550708a310 2024-09-06 2024-09-06 20:32:57.352 Z 1 2024-06-02 2024-07-20 48.0 [\n "Job Board"\n] [\n "disabledperson.com",\n "dejobs.org"\n] [\n "https://www.disabledperson.com/jobs/5948... ... 52 Finance and Insurance 522 Credit Intermediation and Related Activities 5221 Depository Credit Intermediation 52211 Commercial Banking 522110 Commercial Banking
4 cb5ca25f02bdf25c13edfede7931508bfd9e858f 2024-06-19 2024-06-19 07:00:00.000 Z 0 2024-06-02 2024-06-17 15.0 [\n "FreeJobBoard"\n] [\n "craigslist.org"\n] [\n "https://modesto.craigslist.org/sls/77475... ... 99 Unclassified Industry 999 Unclassified Industry 9999 Unclassified Industry 99999 Unclassified Industry 999999 Unclassified Industry

5 rows × 131 columns

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import plotly.express as px
df.info()
df.isna().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72476 entries, 0 to 72475
Columns: 131 entries, ID to NAICS_2022_6_NAME
dtypes: bool(2), float64(11), int64(27), object(91)
memory usage: 71.5+ MB
ID                        0
LAST_UPDATED_DATE         0
LAST_UPDATED_TIMESTAMP    0
DUPLICATES                0
POSTED                    0
                         ..
NAICS_2022_4_NAME         0
NAICS_2022_5              0
NAICS_2022_5_NAME         0
NAICS_2022_6              0
NAICS_2022_6_NAME         0
Length: 131, dtype: int64
df['TITLE_NAME'] = df['TITLE_NAME'].astype(str)
df['IS_AI_ROLE'] = df['TITLE_NAME'].str.lower().str.contains(
    'data|ai|machine learning|ml|artificial intelligence'
).astype(int)
df['IS_AI_ROLE'].value_counts()
IS_AI_ROLE
0    48310
1    24166
Name: count, dtype: int64
features = ['REMOTE_TYPE_NAME', 'EDUCATION_LEVELS_NAME', 'NAICS_2022_2_NAME', 'MAX_YEARS_EXPERIENCE']
df_model = df[features + ['IS_AI_ROLE']].dropna()
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

X = df_model[features]
y = df_model['IS_AI_ROLE']

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[865 175]
 [354 292]]
              precision    recall  f1-score   support

           0       0.71      0.83      0.77      1040
           1       0.63      0.45      0.52       646

    accuracy                           0.69      1686
   macro avg       0.67      0.64      0.65      1686
weighted avg       0.68      0.69      0.67      1686
import plotly.express as px

y_probs = clf.predict_proba(X_test)[:, 1]
fig = px.histogram(x=y_probs, nbins=50, title="Predicted Probability of AI Job", labels={'x': 'Probability'})
fig.show()

This histogram shows the predicted probabilities of jobs being AI-related, with most values falling between 0.2 and 0.6, indicating that the model has moderate confidence in distinguishing AI from non-AI roles.

import numpy as np

features_cat = ['REMOTE_TYPE_NAME', 'EDUCATION_LEVELS_NAME', 'NAICS_2022_2_NAME']
features_num = ['MAX_YEARS_EXPERIENCE', 'DURATION']

df_model = df[features_cat + features_num + ['IS_AI_ROLE']].dropna()

X_cat = df_model[features_cat]
X_num = df_model[features_num]
y = df_model['IS_AI_ROLE']

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_cat)

X_full = np.hstack((X_cat_encoded, X_num.values))
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[498  97]
 [250 163]]
              precision    recall  f1-score   support

           0       0.67      0.84      0.74       595
           1       0.63      0.39      0.48       413

    accuracy                           0.66      1008
   macro avg       0.65      0.62      0.61      1008
weighted avg       0.65      0.66      0.64      1008
import plotly.express as px

y_probs = clf.predict_proba(X_test)[:, 1]
fig = px.histogram(x=y_probs, nbins=50, title="Predicted Probability of AI Job (Enhanced Features)", labels={'x': 'Probability'})
fig.show()

This histogram displays the predicted probabilities of jobs being AI-related using enhanced features, showing a concentration around 0.3 to 0.6, which suggests the model still struggles to confidently separate AI from non-AI positions.

df['SALARY_FROM'] = pd.to_numeric(df['SALARY_FROM'], errors='coerce')
df['SALARY_TO'] = pd.to_numeric(df['SALARY_TO'], errors='coerce')
df['AVG_SALARY'] = (df['SALARY_FROM'] + df['SALARY_TO']) / 2


features_cat = ['REMOTE_TYPE_NAME', 'EDUCATION_LEVELS_NAME', 'NAICS_2022_2_NAME']
features_num = ['MAX_YEARS_EXPERIENCE', 'DURATION', 'IS_AI_ROLE']
target = 'AVG_SALARY'

df_reg = df[features_cat + features_num + [target]].dropna()
X_cat = df_reg[features_cat]
X_num = df_reg[features_num]
y = df_reg[target]

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_cat_encoded = encoder.fit_transform(X_cat)

import numpy as np
X_full = np.hstack((X_cat_encoded, X_num.values))
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)

reg = LinearRegression()
reg.fit(X_train, y_train)


y_pred = reg.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.3f}")
RMSE: 27763.52
R² Score: 0.409
import plotly.express as px
import pandas as pd

df_plot = pd.DataFrame({
    'Actual Salary': y_test,
    'Predicted Salary': y_pred
})

fig = px.scatter(df_plot, x='Actual Salary', y='Predicted Salary', trendline='ols',
                 title='Actual vs. Predicted Salary')
fig.show()

This scatter plot compares actual vs. predicted salaries from the regression model, with the trendline indicating a positive correlation and generally accurate predictions, though deviations increase at higher salary levels.

import statsmodels.api as sm

X_cat = df_reg[features_cat]
X_num = df_reg[features_num]
y = df_reg[target]


X_cat_encoded = encoder.fit_transform(X_cat)
X_full = np.hstack((X_cat_encoded, X_num.values))


X_full_const = sm.add_constant(X_full)


model = sm.OLS(y, X_full_const).fit()

print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             AVG_SALARY   R-squared:                       0.419
Model:                            OLS   Adj. R-squared:                  0.408
Method:                 Least Squares   F-statistic:                     38.26
Date:                Fri, 02 May 2025   Prob (F-statistic):          1.03e-233
Time:                        20:19:54   Log-Likelihood:                -27208.
No. Observations:                2325   AIC:                         5.450e+04
Df Residuals:                    2281   BIC:                         5.476e+04
Df Model:                          43                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.844e+04   3076.562     15.744      0.000    4.24e+04    5.45e+04
x1          6185.2580   3901.325      1.585      0.113   -1465.258    1.38e+04
x2          6240.6035   4529.491      1.378      0.168   -2641.748    1.51e+04
x3          1.964e+04   1930.887     10.172      0.000    1.59e+04    2.34e+04
x4          1.637e+04   1749.327      9.359      0.000    1.29e+04    1.98e+04
x5         -1.815e+04   8064.455     -2.250      0.025    -3.4e+04   -2334.413
x6         -7593.2178   6654.948     -1.141      0.254   -2.06e+04    5457.165
x7          2088.7084   1.29e+04      0.162      0.871   -2.32e+04    2.73e+04
x8         -5.321e+04   2.82e+04     -1.887      0.059   -1.08e+05    2089.448
x9          8243.6458   3043.062      2.709      0.007    2276.187    1.42e+04
x10         2.211e+04   3301.251      6.698      0.000    1.56e+04    2.86e+04
x11         2.888e+04   4903.334      5.889      0.000    1.93e+04    3.85e+04
x12         1.649e+04   9201.957      1.792      0.073   -1550.958    3.45e+04
x13        -2.381e+04   5411.642     -4.399      0.000   -3.44e+04   -1.32e+04
x14         1.659e+04   1.65e+04      1.008      0.314   -1.57e+04    4.89e+04
x15        -1.905e+04    1.3e+04     -1.468      0.142   -4.45e+04    6388.395
x16        -1.674e+04    2.2e+04     -0.759      0.448      -6e+04    2.65e+04
x17        -2.524e+04   4699.026     -5.371      0.000   -3.45e+04    -1.6e+04
x18         2895.4505   1.43e+04      0.203      0.839   -2.51e+04    3.09e+04
x19         2.791e+04   7087.566      3.938      0.000     1.4e+04    4.18e+04
x20         3.086e+04   1.19e+04      2.601      0.009    7594.241    5.41e+04
x21         1.571e+04   3258.754      4.822      0.000    9321.700    2.21e+04
x22         4.044e+04   1.64e+04      2.462      0.014    8226.015    7.26e+04
x23         5021.4975   7086.938      0.709      0.479   -8876.020    1.89e+04
x24         5513.9066   2422.785      2.276      0.023     762.815    1.03e+04
x25         -1.09e+04      2e+04     -0.545      0.586   -5.01e+04    2.83e+04
x26         8921.1183   8685.952      1.027      0.304   -8112.072     2.6e+04
x27         1.828e+04   4057.113      4.505      0.000    1.03e+04    2.62e+04
x28        -9030.8811   3559.594     -2.537      0.011    -1.6e+04   -2050.501
x29         7481.3685   2411.669      3.102      0.002    2752.074    1.22e+04
x30         5487.7656   3317.028      1.654      0.098   -1016.941     1.2e+04
x31        -2470.4504   2880.731     -0.858      0.391   -8119.578    3178.677
x32        -1674.6292   1.27e+04     -0.132      0.895   -2.66e+04    2.33e+04
x33         4946.7968   3778.383      1.309      0.191   -2462.628    1.24e+04
x34        -1069.9221   1.64e+04     -0.065      0.948   -3.31e+04     3.1e+04
x35        -4538.9700   6867.044     -0.661      0.509    -1.8e+04    8927.335
x36         8610.4137   2229.705      3.862      0.000    4237.952     1.3e+04
x37        -1.016e+04   9611.329     -1.058      0.290    -2.9e+04    8683.526
x38         1.702e+04   3814.113      4.463      0.000    9543.213    2.45e+04
x39         1567.8333   4262.420      0.368      0.713   -6790.791    9926.457
x40        -2361.9051   9143.105     -0.258      0.796   -2.03e+04    1.56e+04
x41         1504.9433   2431.035      0.619      0.536   -3262.327    6272.213
x42         6621.0132   6606.447      1.002      0.316   -6334.260    1.96e+04
x43         -326.6600   2612.269     -0.125      0.900   -5449.332    4796.012
x44         7778.2154    276.312     28.150      0.000    7236.366    8320.064
x45          -33.0613     44.169     -0.749      0.454    -119.677      53.555
x46        -8409.9107   1418.748     -5.928      0.000   -1.12e+04   -5627.739
==============================================================================
Omnibus:                     2668.528   Durbin-Watson:                   1.899
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1149034.332
Skew:                           5.268   Prob(JB):                         0.00
Kurtosis:                     111.397   Cond. No.                     1.16e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.4e-26. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
df_cluster = df[['SALARY_FROM', 'SALARY_TO', 'MAX_YEARS_EXPERIENCE', 'DURATION', 'IS_AI_ROLE']].dropna()
df_cluster['AVG_SALARY'] = (df_cluster['SALARY_FROM'] + df_cluster['SALARY_TO']) / 2

X_cluster = df_cluster[['AVG_SALARY', 'MAX_YEARS_EXPERIENCE', 'DURATION', 'IS_AI_ROLE']]
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=42)
df_cluster['Cluster'] = kmeans.fit_predict(X_scaled)
from sklearn.decomposition import PCA
import plotly.express as px

pca = PCA(n_components=2)
components = pca.fit_transform(X_scaled)

df_cluster['PC1'] = components[:, 0]
df_cluster['PC2'] = components[:, 1]

fig = px.scatter(
    df_cluster,
    x='PC1', y='PC2',
    color='Cluster',
    title="KMeans Clustering of Job Types",
    labels={'Cluster': 'Cluster ID'},
    opacity=0.7
)
fig.show()

This scatter plot visualizes KMeans clustering results using PCA-reduced features, revealing four distinct job clusters based on salary, experience, duration, and AI classification.

cluster_summary = df_cluster.groupby('Cluster')[['AVG_SALARY', 'MAX_YEARS_EXPERIENCE', 'DURATION', 'IS_AI_ROLE']].mean().round(1)
display(cluster_summary)
AVG_SALARY MAX_YEARS_EXPERIENCE DURATION IS_AI_ROLE
Cluster
0 104641.6 4.0 47.8 0.5
1 149513.9 6.9 20.9 0.2
2 86715.1 2.7 19.7 1.0
3 92997.1 2.4 17.7 0.0

Regression model: 1. Model goals and features The goal of this regression model is to predict job salary based on structural features and analyze the direction and strength of the influence of features on this result. The selected features are: job name, industry (NAICS classification), state, number of skills required, whether it is a remote position, and whether it is an AI-related position.

  1. Model conclusions and insights The model results show that AI-related jobs generally pay higher salaries than non-AI jobs, and remote job opportunities and jobs located in high-technology areas often offer more competitive salaries. In addition, higher skill demand for jobs generally leads to higher wages, but this growth effect tends to weaken after a certain amount of skill increases.

  2. Advice for job seekers For job seekers looking for a high salary, they can prioritize AI-related positions, or cross positions, and focus on improving machine learning, AWS and other related skills; You can also focus on remote jobs and businesses in other high-paying locations.

Classification model 1. Model goals and features The goal of this classification model is to divide the data sample into two or more categories, predict the category through the input features, and predict whether the job belongs to the ai field. The selected features include keywords in the job title, industry code (NAICS), number of skills required, whether or not to work remotely, and location of Posting.

  1. Model conclusions and insights We constructed two classification models to predict post outcomes, Model two introduced additional features and performed well. The overall precision of the model is improved to 69%, and the recall for AI roles is increased from 0.39 to 0.45. However, the F1 score is still 0.52, and the class imbalance problem still exists.

  2. Advice for job seekers For job seekers interested in entering the field of AI, it is recommended to use relevant job titles in the field of AI on their resumes and focus on skills related to machine learning and cloud computing. In addition, knowing which job titles and keywords are closely related to AI can also help them more accurately screen job postings and optimize their job search strategy.